perm filename STCORR.TEX[DMA,HE] blob sn#663629 filedate 1982-08-22 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00009 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\gdef\pagstrt{T}		% pages start numbering now
C00024 00003	\headsection{Research Stereo Systems}
C00045 00004	\headsection{Stereo System Summary}
C00067 00005	\parindent 0vu
C00083 00006	\sheadsection{GESTALT GPM Photomapping Systems}
C00092 00007	\parindent .7vu
C00103 00008	\parindent 0vu
C00131 00009	\sheadsection{Feature-based correspondence methods}
C00191 ENDMK
C⊗;
\gdef\pagstrt{T}		% pages start numbering now
\chapheading{\vbox{\ctrline{CURRENT}
		   \ctrline{STEREO MAPPING SYSTEMS}}}
\setchaptitle{Stereo Systems}
 

This treatment of automated stereo systems is divided into two sections,
one dealing with existing commercial systems and the other with research
systems.
The basic problem addressed in both of these is the same: to determine the
distance to various locations in a scene by selecting corresponding
points in a stereo pair of images and doing the simple geometric
triangulation; the difficulty lies in
selecting corresponding points.
Commercial systems have been developed to automate
terrain mapping; research systems aim for this and, further, to assign
symbolic interpretations to the scene components.



\headsection{Commercial Stereo Systems}

The development of commercial stereo systems has been driven by
requirements of terrestrial cartography.  Cartography traditionally
concentrated on mapping of terrain, producing elevation contour maps and
digital terrain data bases.  The bulk of stereophotogrammetry is
accomplished by humans who perform stereo correlation or who assist
interactive systems with automated stereo correlation which require
extensive operator intervention. All of these systems use cross-correlation.
To the extent that cross-correlation is limited, they all have similar limitations.
These limitations are in mapping accuracy and applicability to various terrain
types (see, for example, Ryan, Gray, Hunt [\ryana] and Ryan and Hunt [\ryanb]).

\sheadsection{Overview}

The first generation of interactive partially-automated stereo compilation
systems use analog cross-correlation of small image areas (recently
digital correlation) to track corresponding areas in image pairs.  All of
the systems to be described require distinctive texture within the area of correlation;
they break track on sand, concrete, snow, or roofs (ambiguous texture or
featureless area); they break track in trees (ill-defined depth).  The
systems break track where there are occlusions, hence no corresponding
areas in two images, that is, at buildings and thin objects (poles) where
the correlation area crosses surface discontinuities.  Thus, the systems
break track where there is no local correlation (zero signal and where two
images do not correspond) or where the correlation is ambiguous (where the
signal is repetitive).  The systems must be started manually and corrected
when they break track.

U. V. Helava invented the analytical plotter [\helava, \friedman p.\ 703]
while working at the National Research Council of Canada.  The plotter
utilized computers and special servo-mechanisms to move the photographic
plates in a manner that compensated for central projection and other
factors.  Construction of the first analytical plotter, the AP-1, was a
collaborative effort [\friedman, p.\ 703].  Ottico Meccanica Italiana
(OMI) built the optical-mechanical components, Bendix Research
Laboratories had responsibility for the computer, electronic interface,
and computer programs, the National Research Council of Canada consulted
on the development, and the U.S. Rome Air Development Center provided the
financial support through a contract for development.  First delivery was
made in 1961.  A succession of refinements followed [Scarano, private
communication].  The AS-11A, introduced in 1962, exhibited 5-micron
accuracy.  Automated correlation was incorporated into an experimental
AS-11A in 1965.  Today's state-of-the-art instrument, the AS-11B-X
Automated Stereomapper [\scarano; and private communication],
incorporates laser beam epipolar scanning, address modification to
implement the digital equivalent of scan shaping, and digital correlation
to extract elevation data.

In 1960, Ramo Woolridge was awarded a contract to develop for the U.S.
Army Engineering Geodesy, Intelligence Mapping Research and Development
Agency a prototype system for the automatic production of altitude data
and orthophotos from a stereo model as projected on a Kelsh plotter.
This effort resulted in the production by Thompson Ramo
Woolridge of the Automatic Stereo Mapping System [\bertrama], and
ultimately by Bunker-Ramo Corporation of the Universal Automatic Map
Compilation Equipment (UNAMACE) [\bertramb, \bertramc].  These systems
incorporated digital computers to position, electronically scan, and
automatically estimate by correlation corresponding points in a stereo pair
of diapositives.  A digital phase sensitive detection technique accomplishes
locking in and tracking over the 3-D model.  A prototype version of this
automated stereo mapper spun a Nipkow disc (the original TV scanner
consisting of a disc with many small aperatures arranged in a spiral) over a
rectangular window in the plane of the stereo model on the Kelsh.
Computer controlled tilt of the disc accomplished relative stereo window
warping to accommodate terrain tilt.  In later versions the Nipkow disk was
abandoned, partly because it contributed an unacceptable degree of
vignetting.  Thereafter, window warping was not a feature of the system.

During this same period, Gestalt International Limited produced the GPM-1
and GPM-2 Gestalt Photomapping Systems [\kelly, \allam].  The GPM-2
computer-controlled, auto-correlating, analytical photomapper represents a
high level in the state-of-the-art in automated mapping.  The instrument
is designed to operate upon hard copy input, although it could presumably
be redesigned to operate from a digital data base.  Scanning is
accomplished by flying spot scanners, reducing optics that focus the spots
onto the input diapositives mounted on transporter stages, and
photomultiplier recorders.  The system would appear to be well suited to
typical natural terrain analysis, but encounters the same problems
experienced by all related area cross-correlation systems when confronted
with steep and overhanging relief, such as that frequently encountered in
dealing with common cultural artifacts.

Steve Mildenberger has kindly called to our attention that during the
period 1972-76 information was collected by the International Society of
Photogrammetry on the quality of products generated by commercially
available orthophoto systems [\blachut].  Galileo, Gestalt, Kelsh,
Ortho, Wild, and Zeiss instruments were involved in the test.  As the test
was rather of orthophoto production capability rather than automatic
correlation, Unamace and AS-11 equipment was not inclued.  Participating
manufacturers were provided with a stereo pair of 1:10,000 contact
diapositives.  A photograph of the test area selected for the experiment
shows rolling farmland, a portion of which is covered by dense forest,
some isolated trees, what appears to be a dammed section of a river, a
rural road structure, and a few small buildings.  It was noted [Ibid.\ p.\ 4]
that whereas ``many open hills in the test area are particularly suited
for evaluating the accuracy of the planimetric and height information
provided by the orthophoto technique ... [the] buildings and isolated
groups of trees in the area offer a possibility of evaluating the
influence of these `elevated' features n the geometric quality of
orthophotos and contour lines, particularly when using automatic image
correlation.''

The manufacturers were asked to produce the orthophotos and the height
data with their own equipment.  Work was evaluated by the National
Research Council of Canada.  The only systems incorporating automated
image correlation were the Gestalt International Limited GPM-1 and the
GPM-2 (the latter being an improved version of the former.)  As to be
expected, the automatic correlation of the Gestalt systems yielded
deteriorated elevation and rectification data in the vicinity of elevated
features such as trees and buildings.  It was concluded that point
elevations within 10 meters ground scale (1 mm on the original photos),
should not be used in practical projects, at this scale.  Omitting errors
generated within 10 meters of buildings and trees, the accuracy of the
GPM-2 was judged comparable with that of non-automated correlation
equipment.  Its speed of performance in preparation of the control
manuscript and the relative and absolute control procedures was three
times faster than that for conventional equipment (37 minutes vs.\ an
average of 1.5 hours for the nine or so pieces of equipment involved in
the testing).  The GPM-2 is suggested to run approximately 50 percent
faster in the production of orthophotos, along with recorded and plotted
terrain heights, than do non-automated correlation systems.  It is
remarked at the end of the report that stereo orthophotos submitted by
Gestalt International Limited arrived too late to be evaluated and that
analysis of them would be carried out in a separate study.  We have not
seen the subsequent study, but presume that the results of further
analysis would not appreciably influence the general conclusions regarding
performance of the automatic correlation capability reported here.

\headsection{Research Stereo Systems}

	The objective of research stereo systems is the development of accurate,
robust, totally autonomous three-dimensional mapping for input to object or terrain
modelling systems. The principal distinction between these and commercial systems
is in their exploitation of both geometric and photometric constraints for their
analyses. Such systems run on large digital computers with considerable amounts of
memory.  The large memory allows access to image data at rates limited by
electronics rather than mechanics. Some use the semantics of two-dimensional
structural entities in the scene to guide the search for correspondences and
restrict the set of depth determinations --- requiring global consistency.
Being experimental, they tend to work on much smaller images than the commercial
systems described earlier.
None as yet runs in real time, although coming generations of
parallel mechanisms are expected to provide the computational power to enable this.

\sheadsection{Overview}

Given a pair of corresponding points in two views of a scene,
when the relative position and orientations of the two imaging sites is
known, the depth to the viewed spot can be obtained
through a simple triangulation.  The real problem in
stereo analysis is in determining those pairs of corresponding points.
Two main categories of approaches exist in this stereo correspondence work, and the
difference centers on their use of {\it feature-based} or {\it area-based}
correspondences. The {\it feature-based} systems to be described are those
of [\arnold], [\grimson], [\bakerthesis], and [\arnoldthesis], and the
{\it area-based} systems are those of [\gimel], [\levine], [\mori], [\hannah],
[\panton], [\gennery], and [\moravec].

The distinction between `feature' and `area' correspondence here can be more a
matter of degree than type.  {\it Feature-based} analysis involves the
transformation of the sensed data from a discrete two-dimensional
intensity array to a more symbolic form as significant intensity contours,
or `edges' - features. It is the properties of these features which then provide
the metric for the correspondence.  `Feature' is a fairly general term, but its
use here may be equated with `edge'.  There are many fewer `edges' than
image elements in a view of a scene, so this transformation, generally, reduces
the computational cost of determining correspondences.  A drawback is that not every
point in an image is a `feature', so the result of a solely feature-oriented correlation
will not be the dense depth map one may want.

In {\it area-based} analysis two-dimensional windowing operators measure
the similarity in intensity pattern between local areas, or windows, in the two
images. Cross-correlation is used to determine matches between windows in
one image with windows in the other. Normalized cross-correlation has the
ability to compensate for contrast and brightness differences across images.
If lighting and sensor/processing conditions are known, this flexibility
in the algorithm may not be required. In this case
other correlation forms (such as normalized RMS, or absolute difference) may be used.
Area cross-correlation is often
not applied to every pixel in the image arrays, but selectively for those
whose local variance is high.  Variance measures have been used as filters
to limit possible correspondences, with correlation being used to select
the best from among the candidates. These variations may qualify such approaches
as `feature-based', although they will not be considered so here. Perhaps a better
way of categorizing these systems is as {\it feature-driven
area-based} ([\moravec] uses an `interest operator' to select
worthy points in a reference image, [\hendersona]
preprocesses the data to find edges which are then used
to bound the area-based search, [\levine] limits
initial correlation to areas having local maximal variance, and [\gennery]
uses a variance based $F$ test to filter out areas of minimal information, and
therefore minimal interest).

\sheadsection{Area-based processing}
Area-based correspondence
has been applied quite successfully to the stereo analysis of rolling
terrain, but it degrades when the scene is not smoothly
varying and continuous.  In images of such domains many windows to be
matched will have no correspondences in the other image (for example,
those windows lying on surfaces which are occluded from the other imaging
position). The chief difficulty with the area-based approach is in
properly matching window shapes and sizes for conjugate image areas $\ldots$
taking into account both variation in terrain slope and discontinuities at
surface boundaries.

Large correlation window sizes are required in attaining statistical
significance in the sampling, yet the characteristics measured over the
windows become less and less representative of the observed local surface
as this window size increases.  Discontinuities in the surface can cause a
positioned window in one image to be sampling local intensity values from
more than one intensity surface in the other image, and a correct
cross-correlation would only be possible if the window could be
partitioned and matched with (possibly several) windows of various size
and shape in the other image.  Such adaptation requires more flexibility
than area-based correspondence has thus far been shown to provide. Abrupt
discontinuities in topographic structure and an abundance of occlusions
characterize urban or cultural areas. It is at precisely these points of
depth discontinuity that we want to obtain accurate
surface position measurements.  This would suggest that current area-based
processing is inappropriate for domains with occlusions and abrupt depth
discontinities.

Some consideration of this window shaping problem has been attempted in area-based work.
Levine and O'Handley ([\levine]) and Mori, Kidode and Asada ([\mori]) vary their
correlation window sizes with the local intensity variance.
They presume that high variance
implies high local texture and thus suggests the need for smaller
correlation windows, while low variance suggests surface uniformity and
the need for larger sample sizes and larger correlation windows.
[\panton] uses trapezoidal window shapes in the search image, as
determined by previous and predicted correspondence results, to match the
rectangular windows of the reference image.  [\gennery] included a
partial solution to this problem for a specific camera geometry when
looking at windows presumed to lie in the ground plane.  Mori, Kidode and
Asada implemented an iterative technique that would compensate for
terrain variations by successive refinements to image registration
estimates.  Both [\levine] and [\hannah]
included in their algorithms techniques for identifying scene occlusions
and areas of image non-overlap, but these were entered as cases of
exception handling, and it is doubtful that they were adequate as models
of occlusion.

A related problem with area-based correspondence is that increasing window
size improves statistical significance but generally results in
poorer 3-space positioning accuracy for the correspondence. Feature-based
analysis obtains more precise positioning (for its edges) in the individual
images, and it can attain correspondingly higher accuracy for its correspondences
in 3-space ([\arnold] indicates that edge-based techniques offer a factor of 10
improvement in accuracy over area-based correlation methods).

Area-based correspondence systems also tend to be prediction driven,
in that they process an image serially and at each step use the context of
previously matched neighbouring points to limit the search for the present
correspondence. No backtracking facility is provided with this technique, and only
[\gennery] includes a capacity for correcting locally determined miscorrespondences.
With little ability to either correct or detect errors,
such a prediction approach can lead to rapid degeneration once errors begin to occur.

A final, and important distinction between area-based and feature-based processing
lies in the basic philosophy of their analyses.
The underlying assumption of area-based correspondence is that it is the
photometric properties of a scene that are invariant to imaging position,
and the correlating of these properties will be sufficient to allow the proper
correspondences to be determined.  But it is not the measurable photometric properties
that are invariant to viewpoint positioning. In the degenerate, although common
enough case, a surface of a certain intensity seen unobscured from one
viewpoint will not even be visible from another slightly different
viewpoint.  All that can be said to be {\it truly} invariant to viewpoint
positioning is the three-dimensional structure of the scene itself.
A better metric for the correlation would be one which deals in some way
with that scene three-dimensional structure.


\sheadsection{Feature-based processing}

Feature-based analysis, in the form of edge analysis, comes closer to
dealing with this scene structure invariance.
It works generally with the premise that a local measure on the intensity
function is representative of physical change in the underlying
scene.  The local measure on the intensity function could be,
for example, a maximum in intensity gradient --- peak in the first difference
of intensity, zero-crossing in the second difference.
Physical change in the scene could be a break in depth continuity and accompanying
projected surface
reflectance or luminance change, or a change in surface intensity from a
surface detail without topographic break. The point to notice is that feature-based
analysis uses the semantics of intensity variation in its attempt to extract
measures of the physical change in the underlying structure of the projected views,
and uses these two-dimensional observations to infer the three-dimensionality of
the scene.
The validity of this intensity edge tracking in a stereopsis system is apparent:
\vspsml
\bullit{a discontinuity in surface orientation will,
in general, give rise to a variation in incident reflection, which will appear to
an imaging source as a change in brightness -- tracking the intensity edge across
the two views will track the surface discontinuity;}
\vspsml
\bullit{an illumination discontinuity (shadow edge), although not likely corresponding to
a surface discontinuity (the shadow will lie {\it on} the surface), will be visible
as a brightness discontinuity -- tracking the shadow edge across the two views
will provide depth information about the shadow-bearing surface;}
\vspsml
\bullit{surface marking or pigment variation will similarly provide depth information
along the bearing surface.}

\vspsml
Probably the most widely known edge-based stereo scheme to date is that of
Marr and Poggio ([\marrpoggiob]), as implemented in a computer program by
Grimson (see the following summary [\grimson]).
The algorithm has been fairly well tested on a reasonably wide variety of
images (random dot stereograms, natural terrain, urban scenes), and is
at present being implemented in hardware [\nishiharalarson].
The stereo processing system of Henderson, Miller and Grosch of the
Control Data Corporation research group ([\hendersona, \hendersonb]),
called the Automatic Planar Surface System, uses edges to guide it's area-based
matching. They address their work specifically toward the problem of constructing
planar models of rectilinear cultural scenes from stereo pairs of aerial imagery.
[\arnold] developed an edge-based stereo correspondence system that used
local edge properties to select edge match possibilities, and
a weighted iteration process to resolve match conflicts.
Baker's approach ([\bakerthesis]) incorporates both edge and intensity based correspondences,
although unlike the CDC algorithm it actually correlates the edges. It uses
the relatively sparse results of its edge analysis as a template for a fuller
pixel intensity matching process. An extension of the CDC work ([\pantongrosch])
has lead to a stereo correlation system that uses both the {\it local} edge information
(as the above systems) and {\it extended} edge information in its stereo matching.
Recent work by Arnold ([\arnoldthesis]) has resulted in a line-by-line edge-based
correspondence scheme that uses extended edge connectivity to disambiguate
match rivalries.  The line-by-line analysis retains the optimal as well
as several near-optimal alternate sets of correspondences.

\headsection{Stereo System Summary}

\sheadsection{Autonomous processing}
A stereo system to operate for mapping, reconnaissance, or inspection in some domain
must be able to initialize itself and run without the need of operator intervention.
Of the recent systems summarized, only Gennery's and Baker's are designed to
run entirely autonomously.  Panton's appears to require manual initialization, as
does certainly the Henderson, Miller, Grosch system ([\hendersona, \hendersonb]) and, to a
lesser extent, the Grimson system ([\grimson]).  These may also require manual
intervention during the processing --- the Henderson system when there are vertical breaks in
scene continuity, the Grimson system when the disparity differences exceed the size of the
largest convolution operator, and the Panton system when the terrain approaches
discontinuity and the correlator begins to diverge locally from the
correct matchings.

\sheadsection{Domain restrictions}
An understanding of its domain of intended use and an analysis of
its performance capabilities will give us insights into a stereo system's
overall range of application, and thus its utility.
In general, the performance of the area-based correspondence schemes
will degrade rapidly when confronted with scenes of discontinuous
structure, and this makes them inappropriate for the analysis of cultural sites.
The CDC techniques of [\hendersona, \hendersonb] and [\pantongrosch] exclude the
processing of rolling, curved, or even non-rectilinear structures --- predisposed
to the analysis of building tops, they are inappropriate for most everything else.
None of the systems described can work well
where details in the background have reversed positioning
with respect to occluding surfaces lying before them (consider a finger at arms'
length and the background beyond) --- this is referred to as the {\it edge reversal}
problem. The Grimson work is the only
one which does not explicitly exclude such positional reversals between the two imaging
planes (although it probably does so implicitly in the working of its region
disparity consensus). Excluding edge reversals is such a convenient expedient when
working with epipolar geometries that it has been widely accepted for the
correspondence processings.
That it is a restriction becomes obvious when it is noticed that it
prohibits the simultaneous fusion of a thin object (like a pole) and its
background --- relative to the pole, what is left-right in one image
will be right-left in the other.  This artifact of the processing
may be excused to some extent in that it is also observed in human
stereopsis, but there is no necessity to build
limitations of the human system into a machine system
(in their study of the limitations of binocular fusion,
Burt and Julesz ([\burtjulesz]) comment on the impossibility of
fusion of positionally reversed points).

Looking at the range of examples
presented in the published results from these stereo systems also provides insight
to their applicability.
[\panton] has demonstrated a single rolling terrain stereo pair analysis,
as has [\gennery], although Gennery's scene contains some rather large rocks
and the scene slopes off to a (not seen) horizon. Levine and O'Handley show the
processing of two rock-strewn scenes, similar to that of Gennery.  The views in
these area-based systems are, as expected,
of terrain, and depth discontinuities are either not severe or ignored.
Baker shows the results of his processing on two pairs
of images, one cultural (and synthetic), and one of rolling natural
terrain.
Grimson has applied his algorithm to considerably more scenes $\ldots$ many
random dot stereograms, and several real image pairs.
Arnold's system has been applied to the analysis of a single hand-extracted
edge description of a stereo pair ([\arnoldthesis]).

\sheadsection{Global consistency and monocular cues}
The human perceptual system has the advantage that it can call upon higher processes
to comment on the consistency of its visual observations.  Only rarely is our
binocular sight confused by ambiguities, and then this can usually be removed
with a tilt of the head or slight motion to the side for a different perspective
and more information (an observation which lead Moravec [\moravec] to his
development of {\it slider stereo}). An interpretation mechanism is at work with
which our stereo systems at present have little to compare.  Some
researchers have decided that local averaging of depth measurements
provides a reasonable approximation to error correcting, hoping to
diminish the impact of gross errors through the abundance of good correspondences
(for example [\levine] and [\grimson]).
A superior approach is to work within a set of plausible assumptions on
the nature of the viewed world, and use the implications of these
assumptions to chose among ambiguous interpretations.  A common assumption
is, for example, that the world is smooth and continuous most everywhere,
and can be expected to be discontinuous only at those places where the
viewed luminance is undergoing abrupt change -- that is, to presume that
significant changes in scene brightness may separate areas of different
depths, and where there are not such brightness changes, the surface is
likely to be smooth and continuous.

The way such knowledge enters the analysis varies.  In some work, the
continuity assumption is used in prediction.  Levine, Panton, and Gennery,
in their area-based systems, use the context of neighboring points to
limit the search for point correspondents, presuming that points
neighboring in two dimensions should be neighboring in three dimensions.
But this has problems of consistency -- the results would change were the
analysis to be done right to left rather than left to right, and decisions
are made locally, in a set direction, never to be revised.  The MIT
Grimson work makes good use of inference on the continuity of surfaces and
the lack of edge signal in its interpolated surface fitting (see the
summary). However its use of context in its local edge correlation is only
cursory, in that matching at a lower resolution (lower spatial frequency)
is a prerequisite for matching at a level of finer detail (higher spatial
frequency).  A global metric is used in consistency checking over regions
--- requiring $70\%$ of the disparities to be in agreement (one standard
deviation, presumably), but this has been implemented without adequate
analysis (see [\grimson] page 213, where it appears to produce a
highly quantized, planar effect). [\schumer] discusses a possible
mechanism in the human system for this spatial averaging of disparities.
The Baker algorithm uses a similar global continuity assumption arising
from two-dimensional connectivity to resolve disparity ambiguities along
connected contours --- the assumption being that projected connectivity in
two-dimensions is a good indicator of connectivity in three-dimensions,
and abrupt changes in disparity along a path connected in 2-D suggests a
correspondence error.

\sheadsection{Identifying depth discontinuities}
An issue related to the use of global continuity assumptions is the
identification of depth discontinuities in the scene -- those places where
the viewed surface is not smooth and continuous.  This capacity has not
been reliably incorporated into area-based analyses, where poor matches
arising from occlusions or extreme perspective effects merely return a low
correlation value, indistinguishable from other causes of poor matches. In
cases of occlusions, the intensity values in a window about the depth
discontinuity in the two views would have little likelihood of
corresponding, and the correlation coefficient as a measure of similarity
is inappropriate here.  Edge-based analyses operate with the artifacts of
(among other things) depth discontinuities, and the inference is available here
for distinguishing occlusions and abrupt changes in depth.
In [\bakerthesis], edges found by a zero-crossing
operator are split into their left and right components, and these
`half-edges' are matched across images.  Contrast variations (even
reversals) needn't affect the correlation.  Surface matching (based
primarily on intensities) is performed within intervals defined by pairs
of such corresponding half-edges -- occluded regions will not have pairs
of corresponding half-edges (Baker points out in [\bakerthesis] that the
algorithm does not use two-dimensional global information in its pixel
intensity correlations -- smoothness between space contours is only
enforced along the baseline axis -- and this should be altered in a better
solution).

In an early interesting variant on area-based correspondence, [\levine] used a technique
of mixing area-based correspondence with feature-based guidance. Image
points having locally maximal directional intensity variance were selected
and correlated first. These they referred to as `tie-points'.
The disparity values for these tie-points then
limited the search range for neighbouring image points, and certain
heuristics were introduced to infer and compensate for occluded regions.
The tie-points also provide a disparity context for the matching of nearby
image elements. [\mori] alludes to
a similar technique, saying that they process first those parts of the images
having high intensity contrast, and use the disparity determined there, with an
assumption of scene continuity, to constrain the disparity of neighbouring
pixels.

\sheadsection{Parallelism possible}
A stereo system to be used for tasks of navigation or process control must be
judged on its ability to provide depth measurements
at rates approaching real-time.  The enormous amount of computation inherent
in the analyses makes it unlikely that a scheme with intrinsic ordered
dependence in its processing will be able to provide adequate speed.  The
potential for parallelism in the algorithm is an important consideration.
Neither the Gennery nor the Panton approaches could take full advantage of
the high parallelism possible in the computation since they process
from left to right in columns across the match image, relying upon previous
correspondences to constrain the search for matches.  The Henderson and Levine approaches
are similarly limited, in that they process by lines from image bottom to top, with
each line progression passing up the results of the preceding line
analyses to constrain the search. The Grimson and Baker algorithms are both
amenable to parallel implementation.

\sheadsection{A Composite Solution}

{\it The principal problem with the {\it area-based} cross-correlation approach
lies in the difficulty of shaping image patches for the matching across images.}
Projective distortions and occluding surfaces exacerbate the problem of chosing
frames for the comparison of conjugate areas.  This window shaping difficulty makes
area correspondence fail in areas of varying relief and it fails wherever there
is occlusion (i.e.\ at depth discontinuities).

{\it {\it Edge-based} correspondence encounters difficulty when edge density or lack
of local distinguishing edge characteristics make for ambiguous matching.}  More
global characteristics (either in the projective images, or over time) could
resolve the correspondence.  The globality here could come from connectivity
over extended edges (as used in [\bakerthesis], [\pantongrosch], and
[\arnoldthesis]) or contextual cues as discussed in [\binfordaij] ---
these utilize monocular cues to stereopsis. It could also come
from stereo observation over a series of views  --- either spatially, as in
[\moravec] using 9 views of the scene,
or temporally, with a focus on ambiguous areas and a feedback loop to adjudicate
the rivalry.

{\it The matching of extended edges will also not be without ambiguity.}
Occlusions will disrupt projective continuity, and the matching of these
extended edges need not be one-to-one.  The evidence of local edge
correspondences could help in the resolution of these conflicts.

What should be seen is that there will be ambiguities -- they can never be
guaranteed to be resolved within a certain {\it spatial} frame, or within a certain
{\it temporal} frame.

\vspsml
\bullit{{\it local edge} matching and {\it extended edge} matching can help each other
in a pair of views;}

\vspsml
\bullit{observations over {\it time} or over {\it multiple views} can assist both;}

\vspsml
\bullit{access to a {\it model memory} can be used to guide, or disambiguate surface
interpretation;}

\vspsml
\bullit{{\it edge} correspondence provides the surface boundary and surface orientation context
that eludes and invalidates area-based cross-correlation approaches;}

\vspsml
\bullit{{\it area-based}
correspondence, where it is applicable, away from surface boundaries, specifies
the shape of the surface $\ldots$ edge correspondence only specifies the position of
surface boundaries or interior detail.}

\vspsml
A stereo mapping system expected to process
images of both cultural and rural scenes with precision and reliability will need
the benefits of each of these correspondence approaches:
\vspsml
\bullit{edges (binocular cues),}
\bullit{extended edges and edge contexts (monocular cues),}
\bullit{integration over time (resolution of ambiguity through time),}
\bullit{integration over multiple images (resolution of ambiguity through redundancy),}
\bullit{area-based cross-correlation (fuller surface description).}

\newpage
\parindent 0vu
\headsection{Commercial System Summaries}

\sheadsection{Bunker-Ramo UNAMACE}

\sayref{``Automatic Map Compilation,'' Bertram, S.,
Photogrammetric Engineering, p.\ 184, January 1963.}

  A description is presented of a prototype model of an optical-electronic
  automatic stereomapping system that can be added to conventional plotters
  to produce altitude information and orthophotos automatically from pairs
  of aerial photographs.  A pair of light sources is projected through
  diapositive stereo pairs so as to cross in image space at a 0.3$↑{\prime\prime}$x0.1$↑{\prime\prime}$
  window mask and thence to a pair of photomultipliers.  A Nipkow disk, the
  original TV scanner, spins over the window, effectively scanning lines
  from the window.  The window-disk subsystem is controlled to move as a
  unit through image space along the surface of the stereo model.  The
  scanning disk tilts to conform roughly with the tilt of the terrain.
  Control of the window to follow the surface of the model is accomplished
  as follows.  The output of each of the photomultipliers passes through a
  delay line.  The delayed signal from multiplier 1 is correlated with the
  undelayed signal from correlator 2, and vice versa.  The outputs of the
  pair of correlators is input to a differencing network.  When the window
  is locked onto the surface of the stereo model, the output of the
  differencing network is zero.  If it is off the surface of the model, the
  sign of the network output indicates the direction to move to return to
  the surface.  The device amounts to a phase sensitive detector.

  It is reported that ``rugged terrain could be handled quite readily if a
  very small signal were used''.  Besides the difficulties to be expected
  over water, cliffs, and relative obscurations, the system exhibited
  limitations associated with the direct projection system and the Nipkow
  disc.  Vignetting effects limited the signal in the corners of the model
  and at large angles of model tilt.


\sayref{``The Automated Map Compilation System,'' Bertram, S.,
 Photogrammetric Engineering, p.\ 678, July 1963.}

  An experimental automatic map compilation system is described.  Scanning
  of the stereo diapositives is accomplished with flying spot scanners and
  movable lenses.  An operator is able to monitor system performance through
  an electronic stereo viewer involving a twin TV system.  The instrument
  scans in a profiling mode, with a film carriage moving in a
  direction nearly perpendicular to the flight line, and sampling in 0.01$↑{\prime\prime}$
  steps.  Profile lines are spaced 0.02$↑{\prime\prime}$ apart.  The operator can intevene
  if the system is observed to drift off the model surface, or if alerted by
  a ``lack of correlation light''.  As the diapositive carriage moves smoothly
  during the scan, the electronics incorporates ``stop motion'' circuitry
  analogous to that of an image motion compensation system of a camera.  The
  sampling window is 0.050$↑{\prime\prime}$x0.050$↑{\prime\prime}$.  Whereas, the system incorporates a
  square scan, plans are offered for converting to a parallelogram, of a
  shape to be dictated by the slope of the local terrain.  The sensitivity
  of the instrument with ``good imagery'' is reported to be comparable with
  that of a human operator.  Steep cliffs, differentially hidden areas, 
  and areas of low detail cannot, of course, be automatically processed.



\sayref{``The Universal Automated Map Compilation Equipment,'' Bertram, S.,
 vol.\ 15, part 4, International Archives of Photogrammetry, 1965;
 Photogrammetric Engineering, p.\ 244, 1965.}

  This paper was written at the time that the Universal Automated Map
  Compilation Equipment (UNAMACE) was nearing completion at Bunker-Ramo.
  The UNAMACE was based on the principles demonstrated in the Automated Map
  Compilation System, described in the earlier papers by Bertram.
  It is suggested as advisable, in order to address ambiguous correlation
  peaks that can arise from repeated structures such as orchards, to use
  ``dual correlators, with a high acuity channel taking advantage of the
  available resolution in the photographs, [and a] low-acuity channel
  $\ldots$ operating on low-resolution imagery in the photographs''.
  The intent is that ``the high-acuity channel provide tight tracking most
  of the time, with the low-acuity channel helping over difficult areas
  (such as large altitude changes)''.

  Correlator circuit bandwidth is cut off at the low frequency end to supress
  general highlights, and the high end near the intended frequency resolution
  of the correlation.  The UNAMACE incorporates a high-acuity and a low-acuity
  height-error sensor, the outputs of which are ``appropriately summed'' to
  provide a net height-error signal.

  Transformation from object space to film space accounts for camera model,
  lens and film distortion, earth curvature and atmospheric refraction.
  Each stereo diapositive is mounted on a separate scanning table with its
  own flying spot scanner.  Scanning operations are controlled by a
  Bunker-Ramo Model 133 computer.  The system could process highly
  convergent photography, and distortions of panoramic cameras.  The tables
  could accomodate 9$↑{\prime\prime}$x18$↑{\prime\prime}$ glass plates, and
  translate up to 2 inches per second at
  4-micron rms accuracy.  Correlation tracking incorporates extrapolation
  from previously processed points.  The high-speed scan direction (X) is
  parallel to the line joining the two camera stations.  A TV-like coverage
  is input to the correlator at each sample location, with the relatively
  slower scan in the Y-direction.  Correlators processing the Y-component of
  the signal are used to remove Y-parallax errors from the data.

  A stereo-viewer on the control console enables monitoring of the automatic
  compilation.  The viewer incorporates two 4 inch, square-faced CRTs.  Stereo
  viewing is accomplished via crossed Polaroids and image merging through a
  beam splitter.  A stereo floating mark is generated electronically.

  Altitude determinations are made at a rate of 100 points per second,
  representing a three-fold improvement over the earlier system.  At the
  time of Bertram's writing of this paper, the effectiveness of the UNAMACE
  operational cycle remained to be verified.  It was anticipated that
  significant improvements would result from programming modifications that
  would permit the density of measurements to adapt to the terrain.  It was
  estimated that the UNAMACE would compile 100\%\ lap diapositives in about 3
  hours using 250-micron spacing along the profile and 500-micron spacing
  between profiles.

  The UNAMACE produces orthophoto products as well as elevation contour
  information.  It was recognized the it would be desirable to generate
  contour lines rather than the rotary sequence of three gray tones of the
  system described. 

\sayref{``The UNAMACE and the Automatic Photomapper,'' Bertram, S.,
 Photogrammetric Engineering,
 vol.\ 35, no.\ 6, p.\ 569-576 June 1969.}

  The UNAMACE has been found to operate at up to 80 points per second.
  Diapositive table control is repeatable to 2-micron, with an absolute
  accuracy of 4-micron.

  A small ruggedized van-mounted version of the UNAMACE called the Automatic
  Photomapper has been developed.  It is restricted to 9$↑{\prime\prime}$x9$↑{\prime\prime}$ photographs.


\newpage

\sheadsection{Bendix AS-11B-X}

\sayref{``A Digital Elevation Data Collection System,'' Scarano, Frank A.,
 Photogrammetric Engineering and Remote Sensing,
 vol.\ 42, no.\ 4, p.\ 489, April 1976.}

  The AS-11B-X automated stereomapper experimental system was developed by
  Bendix Corporation for the Rome Air Development Center specifically to
  generate elevation data in digital form.  It employs laser scanning along
  epipolar lines, analog-to-digital conversion of photo data, digital
  storage, and digital correlation.  The elevation points are interpolated
  off-line to a fixed grid.  The instrument performs single-pass scanning,
  interim random access digital storage of density values, and subsequent
  recall for correlation calculations.  Correlation window shaping is
  accomplished by address modification of digitally stored imagery data.  In
  the case of perfectly registered images, the utilization of the epipolar
  scan pattern anables correlation matching to be accomplished along
  corresponding epipolar lines.  It is reported that the variance of
  residual parallax in the direction transverse to the epipolar lines is
  always small, implying that the effect of y-parallax variation can be
  neglected.

  Image scanning is accomplished by optical-mechanical straight-line
  deflection of the laser beams over the images.  It is thus implied that
  scanning of nonlinear epipolar lines is accomplished by linear
  approximation.  The image density is sampled at 1280 points spaced at 20
  micron intervals over a 25.6 mm deflection of the laser beam.  The sampled
  points are partitioned into 58 segments each of which is associated with a
  separately developed profile running in a direction transverse to the
  laser sweep.  Successive sweeps of the laser beams are displaced 50
  microns apart by a mechanically stepping the film stages.

  Correlation patch shaping and digital correlation are accomplished by
  special-purpose digital logic.  Digitized imagery values are discarded as
  soon as each local area is correlated.  It is stated that normalized
  correlation is used both for evaluation of system performance and
  determination of system stategies, but that image-correspondence
  determination is based on faster local covariance maximization.  Regions
  for which the correlation threshold falls below an established threshold
  are filled in by a human operator.

\newpage
\sheadsection{GESTALT GPM Photomapping Systems}

\sayref{``The Gestalt Photomapping System,'' Kelly, R.\ E., P.\ R.\ H.\ McConnell, and S.\ J.\  Mildenberger,
 J. of Photogrammetric Engineering and Remote Sensing,
 vol.\ 43, p.\ 1407, 1977.}

  The Gestalt Photomapping System is a computer-controlled, autocorrelating,
  analytical photomapper, combined with an off-line plotting unit.  The
  photomapper is composed of a pair of scanners, an automatic image
  correlator, a control computer, an operator's console, and one or two
  printers.  Each scanning unit is comprised of a flying spot scanner,
  reducing optics that focus the spot onto the input diapositive mounted on
  a transporter stage, and a photomultiplier recorder.  The input
  diapositives are processed in 0.72 sq. cm. patches.  Auto-correlation of
  an array of 2444 points (47 x 52 matrix) spanning each patch is
  accomplished by analyzing and digitizing the photomultiplier output,
  measuring parallax for each point, storing and smoothing the parallax
  data, calculating and implementing changes in the video scanning pattern
  to minimize parallax values, reshaping the scanning patterns accordingly,
  and then repeating the entire processing sequence until the parallax
  values of the individual points stabilize.  This process, referred to as
  complete differential rectification, occurs over the square centimeter
  patch instead of at a floating mark, and does so in less than a second.
  The system produces orthophoto hard copy products, as well as magnetic
  tape output.  Contour or profile lines may be optionally overlayed on the
  orthophotos.  The system contour sheets differ from conventional map
  maunuscripts in the important respect that they contain only topographic
  information and not planimetric details, such as roads, buildings, et cetera.

  The Gestalt Photomapping System represents a high level in the
  state-of-the-art in automated mapping equipment.  The instrument is
  designed to operate upon hard copy input, though could presumably be
  redesigned to operate from a digital data base.  The system would appear to
  be well suited to typical natural terrain analysis, but must be presumed
  to encounter the same problems experienced by all related area
  cross-correlation systems when confronted with steep and overhanging
  relief, such as that frequently encountered in dealing with common
  cultural artifacts.



\sayref{``DTM Application of Topographic Mapping,'' Allam, M.\ M.,
 Photogrammetric Engineering and Remote Sensing,
 vol.\ 44, no.\ 12, p.\ 1513-1520, Dec.\ 1978.}

  A description is presented of an application of the Gestalt Photomapper
  GPM-2/3 to DTM (Digital Terrain Mapping) by the Topographical Survey
  Division, Canadian Department of Energy, Mines and Resources.  The GPM-2/3 is
  described as a highly automated stereophotogrammetry system designed to
  produce DTMs, orthophotos, and photographic contours by using electronic
  image correlation to measure parallaxes.  The instrument incorporates
  flying spot scanners and a digital correlator module.  The correlator
  iterates over all 2444 points in a 182-micron patch 50 times per second.
  The height determinations are used to differentially transform each
  scanner's raster until the stereo correspondence processing stabilizes
  over the patch.

\newpage

\sheadsection{Stanford Viking Mapper}
\ljust{(a computer-based, non-commercial stereo photomapping system)}

\sayref{``Viking 1975 Mars Lander Interactive Computerized Video Stereophotogrammetry,''
Liebes, Jr., S.\ and A.\ A.\ Schwartz,
Journal of Geophysics Research, 82, no.\ 28, p.\ 4421, Sept.\ 30, 1977.}

A computer aided stereophotogrammetry system was developed by Liebes and
Schwartz in support of the Mars Viking 1975 lander camera system.
Whereas the stereo correlation was performed by a human
operator, the system exhibited a versatile stereophotogrammetry
capability.  As the system was developed outside the mainstream of the
photogrammetric community, it is not widely known.  Digital stereo images
radioed back to Earth from Mars were stereoscopically viewed on a pair of
high resolution video monitors of a Stereo Station.  Visual stereo model
formation was complicated by the combination of spherical coordinate
scannning geometry of the lander cameras, differential stereo illumination
conditions, image acquisition resolutions that could differ by a factor of
three, and near field, high-convergence stereophotogrammetry requirements.
Stereo model formation was facilitated by independent orthogonal image
size, brightness and contrast controls, and image rotation via motorized
rotation of the monitor deflecting yokes.  A graphics 3-D point mark was
overlaid into the imagery prior to routing of the visual data to the video
monitors.  Thus, analogue image warping at the monitors had no effect on
the relationship of the 3-D mark to the imagery.  The 3-D mark could be
moved through the perceived Martian relief by hand input control.
Individual points could be ranged, and profiles of arbitrary character
could be generated by selection of the family of surfaces of profile
constraint.  Parameters relating hand input control position to 3-space
mark location could be varied to suit operational conditions.  The system
had real-time editing facilities, could optionally display the profile
graphics as stereo image overlays, stereoscopically reproject the profiles
from arbitrary viewing directions, and output the topographic data in a
variety of formats and scales.

\newpage
\parindent .7vu
\sheadsection{Published Reviews}

[\friedman, p.\ 699-722] presents a review of automation of the photogrammetric
process.  The review covers mensuration, rectification and
orthorectification, elevation determination, planimetry delineation, and
integrated systems, as well as a lengthy section on the history,
principles, and current designs, and also a section devoted to
methods of automatic image correlation.  He reports [Ibid., 701] that a
thorough description of on-line automated elevation determination systems
is presented in [\dowman].  We have not, however, been able to secure a
copy of this report.  Friedman also reports that the entire November 1978
issue of Photogrammetric Engineering and Remote Sensing is devoted to
automated analytical stereoplotters; however, we find this reference to be
erroneous.  Perhaps the intended pointer was to the December issue, which
contains discussion of the research work of CDC [\panton], and
applications of the Gestalt Photomapper [\allam], and a Galileo Santoni
2C Orthophoto System manufactured by the Galileo Corporation of America.  No
literature references are given to the Galileo system.  Automatic
orthorectification is reported to be achieved by the Wild B-8 Stereomat,
and the Jena Topomat, as well as by the UNAMACE and the Gestalt GPM-2.
Friedman [\friedman, p.\ 701, 719] also reports that Horbrough
[\hobrough] has proposed another digital on-line correlation concept,
called RASTER.  The particular novelty appears to be in the use of a solid
state linear array scanner of 1728 elements, rotated by servomotors to
lie substantially along epipolar lines.  Friedman remarks [\friedman, p.\ 701]
that the future of elevation determination and
rectification may well lie with off-line digital image processing.
Off-line processing is exemplified by the research work of Panton (et.
al.) at CDC, and, though not noted in Friedman's review, by the work of
Arnold, Baker, and Binford at Stanford University, Grimson and Marr at
MIT, and others, as described elsewhere in this survey.

U. V. Helava [\friedman, p.\ 703] is credited, while working at the National
Research Council of Canada, with the invention of the analytical plotter
[\helava], used for non-automated, but computer assisted correspondence
computations.
The distinction relative to what he refers to as the
``simulation principle'' is that in simulation systems real physical
projection is employed.  Helava cited as disadvantages of the simulation
approach the inflexible, bulky, extremely high precision instruments
required.  In the analytical plotter, computers and special
servo-mechanisms move the photographic plates in a manner that compensates
for central projection, and other factors.  The first analytical plotter
AP-1 was built by the NRC.  The bulk of the analytical plotters on the
market today do not incorporate automated correlation.

Friedman, in addition to discussing the principles of the analytical
plotter [\friedman, p.\ 704], describes eight current designs
[Ibid., 709] that were presented at the Analytical Stereoplotter Workshop
and Symposium in Reston, VA, in April of 1980.  None of these systems
employ automated correlation.  These instruments and their manufacturers
are as follows:

\vspsml

\def\sldevice#1#2{\vbox{\hbox{#1}
		      \hbox{\hskip .5vu \vbox{\hbox{#2}}}}
		\vspsml}
\sldevice{Analytical Photogrammetric Processing System IV (APPS-IV)}{Manufacturer:
Autometric, Inc, Falls Church, VA.}

\sldevice{Analytical Stereoplotter Model AP/C4}{Manufacturer:
Ottico Meccanica Italiana S.p.A. (OMI)of Rome, Italy.}

\sldevice{Autoplot}{Manufacturer:
Systemhouse, Ltd., Ottawa, Canada.}

\sldevice{C100 Panicomp Analytical Plotter}{Manufacturer:
Zeiss, Oberkochen, Germany.}

\sldevice{DSC-3/80 Analytical Stereoplotter}{Manufacturer:
Keuffel & Esser Co., San Antonio, Texas.}

\sldevice{Galileo D.S. Digital Stereocartograph}{Manufacturer:
Officine Galileo S.p.A. of Florence, Italy.}

\sldevice{Traster Analytical Stereoplotter}{Manufacturer:
Matra Optique, Rueil-Malmaison, France.}

\sldevice{US-2 Analytical Stereoplotter}{Manufacturer:
Helava Associates, Inc., Southfield, MI.}

Additional analytical stereoplotters that are not well publicized, because
of their highly specialized character, but which are stated to ``represent
large investments, produce tremendous amounts of photogrammetric data and
are examples of the latest technology'' are also cited [Ibid., 715]:

\vspsml
\vbox{\hbox{AS11/A1}
	\hbox{\hskip .5vu
	      \vbox{\hbox{Ottico Meccanica Italiana (OMI), Rome, Italy}
		\hbox{Control computer manufacturer: Modular Computer Corp.}
		\hbox{Interface manufacturer: O.M.I. Corporation of America}}}}
\vspsml
\vbox{\hbox{AS11/A1(NOS)}
	\hbox{\hskip .5vu
	      \vbox{\hbox{Interface: PDP 11/45}
		 \hbox{Manufacturer: OMI}}}}
\vspsml
\vbox{\hbox{TA3/P1 & TA3/P11}
	\hbox{\hskip .5vu
	     \vbox{\hbox{Control computers: PDP 11/60}
		   \hbox{Interfaces: Bendix Research Laboratories}
		   \hbox{Principal users: U.S.D.M.A., U.S.G.S.}
		   \hbox{Manufacturer: OMI}}}}

Another analytical plotter, referred to only in passing [Ibid., p.\ 701], is
the Anaplot.  The entire June 1979 issue (pp.\ 89-192) of the Canadian
Surveyor is devoted to this instrument, which Blachut (p.\ 89) refers to as
``in many ways, $\ldots$ the most advanced system of its kind in operation.''
The instrument does not incorporate automated stereo correlation.  A quick
scan of the volume suggests that this prototype system, constructed by the
National Research Council of Canada, is noteworthy for the quality of
the design and its implementation rather than for any fundamental novelty.
Jaksic remarks (p.\ 95) that they have sought ``high universality,
modularity, accuracy, speed, reliability, stability, flexibility and
convenience in operation'', a goal which is reported to have been achieved
by ``the optimal design and choice of components and the proper integration
of the system's hardware and software''.  According to Blachut (p.\ 93) the
instrument was to have been constructed by Instronics Ltd., of
Stittsville, near Ottawa, under contract and close collaboration with the
National Research Council of Canada.  Instronics, however, was bought by
Gestalt International Ltd., and Gestalt declined to build the system.
The Canadian Marconi Company (CMC) of Montreal secured the license for
construction, and as of the time of publication of the cited reference was
proceeding with manufacture.

\newpage
\parindent 0vu
\headsection{Research System Summaries}

\sheadsection{Area-based correspondence methods}

\ssheadsection{Gimel'farb, Marchenko and Rybak System 1972}
\sayref{``An Algorithm for Automatic Identification of Identical Sections on Stereopair Photographs,''
G.\ L.\ Gimel'farb, V.\ B.\ Marchenko, and V.\ I.\ Rybak,
Kybernetica (translations) No.2, p.\ 311-322, March-April 1972.}

	The authors of this paper were the first to use dynamic programming in a
	stereo correspondence process.  The algorithm they describe processes
	image pairs on a
	line-by-line basis $\ldots$ exploiting epipolar geometry constraints, and
	using known (a priori)
	disparity and surface slope limits to constrain the correspondence search.
	It optimizes a cost function of normalized cross-correlation.  The
	convolution incorporates a lateral inhibition computation.  The correspondence
	algorithm is described analytically as
	finding the function mapping intensities from one image line to the other.
	Testing was done on short wide images (i.e. 5x500).
	The authors suggest that one could improve the speed of such stereo
	processing in two ways. First, in using the results of prior line analyses
	to guide the
	matching and bound the search on subsequent lines, and second, in
	partitioning lines into smaller stretches, reducing the combinatorics of
	the correspondence matching.
	The first is a technique that CDC used in their
	stereo work (as will be discussed).
	The second can be seen as a preview of the multiple resolution
	correspondence processes of Baker, when it is
	seen that rough alignment of corresponding
	parts of the two lines must be made before breaking them into smaller
	stretches.
	Depiction of the results obtained with the algorithm are a bit sketchy,
	as the plots shown are of single line analysis only.  The report
	says that results from
	this totally automated process are comparable to those of
	human operators using automated photomapping devices.

\ssheadsection{Levine and O'Handley System 1973}
\sayref{``Computer Determination of Depth Maps,'' Martin D.\ Levine, Douglas A.\ O'Handley, Gary M.\ Yagi,
Computer  Graphics and
Image Processing, 2, p.\ 131-150, 1973.}

    This system described by Levine and O'Handley was intended to
    provide depth information for the Mars rover vehicle's autonomous
    navigation.  Tests of its performance were carried out on stereo imagery
    collected in the vicinity of the Jet Propulsion Laboratory.  Because of
    the system's intended use, it was possible to work with the basic premise
    that the scene viewed is approximately planar, running off to a horizon
    somewhere in the distance (not necessarily in the images).  It uses
    collinear epipolar imaging for its two cameras to limit correspondence
    search. Matching is by intensity cross-correlation, with an adaptive
    window size set by the variance at pixel $(i,j)$ in the image -- a large
    variance sets a small window size, and vice versa. Processing was
    organized to run in lines from bottom to top.  Search constraints on
    possible disparities are exploited throughout the analysis.  First the top
    and bottom lines are correlated to estimate the overall disparity ranges
    (notice that this presupposes that scene depth varies regularly from top
    to bottom, as in a view toward the horizon).  Then a prepass analysis is
    applied to a sampling of $n$ lines ($n=5$) to set local maximum and
    minimum disparity ranges.  Correlation along a line pair is over windows
    with locally maximal variance, called `tie-points'.  The local maxima are
    used to iteratively segment the reference line.  A coarse search using
    statistical parameters (variance) of image windows is used to find good
    candidates for the more expensive computation of the correlation
    coefficients. The candidate pairings chosen through this process are then
    evaluated to select the optimal matches and to refine their positions in
    three space.  The coarse search is done with every other pixel along a
    line.  Cross-correlation is only done with windows of similar variance.
    The system uses the epipolar geometry constraint in a way that prohibits
    positional reversals along a line.  The authors indicate in the paper that
    they are aware of the difficulties introduced by occlusions, and mention
    an ad hoc scheme for preventing parts of the images felt to be occluded
    from being matched, but the technique is not further described.
    Two-dimensional proximity is also used to limit disparity possibilities;
    an allowable range is set at each tie-point by examining neighbouring
    disparity values on the preceding line (actually the current line minus 4
    --- i.e.\ they process every fourth row and every second element).  Final
    disparity values are smoothed, and deviants are removed.

\ssheadsection{Mori, Kidode and Asada System 1973}
\sayref{``An Iterative Prediction and Correction Method for Automatic Stereocomparison,''
Ken-Ichi Mori, Masatsugu Kidode, Haruo Asada,
Computer Graphics and
Image Processing, 2, p.\ 393-401, 1973.}

	Mori, Kidode and Asada, in this short paper that raises as many questions as
	it answers, describe an interesting stereo mapping system.
	Epipolar geometry is used to constrain the search for correspondences
	in the area-based correlation they use.
	The system is demonstrated on a model pattern and a pair of aerial
	photographs, although only a single line of results is presented.
	A gaussian weighted correlation function is used to diminish the
	effect of peripheral intensity variations.  Window size is modified
	by the range of disparity expected for the point, and they suggest
	that this should be set by first correlating over a large window,
	then narrowing to a smaller window when the gross disparities are
	known (the paper doesn't explain this resolution reduction process any
	further). An assumption of scene continuity is also used in limiting
	correspondence search. The technique is iterative:
	the right image is repeatedly distorted and compared with the left image
	until no substantial intensity differences are found.
	The abstract says
	that the first matching is done on highly contrasting parts of the
	images (`roads, coast, forest edges'), and the context of this is used,
	with the smoothness assumption, to expand the correspondences into neighboring
	parts of the scene, but the body of the paper does not elaborate on this.
	The paper is very brief and  cursory, suggesting much more than it reveals...
	it would be very interesting to see whatever further documentation they have
	on this system. Examples are incomplete and inconclusive.
	No follow-up has occurred to this work.

\newpage
\ssheadsection{Hannah System 1974}
\sayref{``Computer Matching of Areas in Stereo Images,'' Hannah, Marsha Jo,
Ph.D.\ Thesis,
Stanford Artificial Inteligence Laboratory, AIM-239, Ph.D.\ thesis, July 1974.}

	This thesis describes a series of techniques developed for increasing
	the efficiency of area-based correlators. It contains a
	nice discussion of the differences between Discrete Correlation, Normalized
	Cross-Correlation, Normalized RMS correlation, and Normalized Absolute
	Difference.
	The work takes an exploratory approach, and documents the improvements
	arising from:

\vspsml
\bullit{correlating over a sampling of the image arrays, then refining the
	match estimate using the full arrays at the point having maximum
	correlation coefficient (this is referred to as `gridding'),}
\bullit{correlating over reduced resolution depictions of the images, and then
	refining match estimates with the higher resolution depictions,}
\bullit{abstracting area characteristics (mean/variance), and using these
	more symbolic descriptions for limiting windows to be cross-correlated,}
\bullit{using known camera geometry constraints to limit search.}

	A region growing approach is taken in expanding correspondences
	outward from matched pairs (using an assumption of surface continuity).
	Various heuristics are introduced for inferring the distinctions
	between occlusions, corrrespondence errors, and out-of-scene overlaps.
	Hannah introduced here, through the autocorrelation function,
	a means of assessing the quality of area-based matches.

\ssheadsection{Panton System 1978}
\sayref{``A Flexible Approach to Digital Stereo Mapping,'' Panton, Dale J.,
Photogrammetric
Engineering and Remote Sensing, Vol.44, No.12, December 1978.}

	The author  describes a system for obtaining a dense digital depth
	map of smoothly rolling terrain.  The algorithm, using intensity
	cross-correlation, processes from left to right in the images, and so,
	once initialized, can use local context of previous matches and estimates
	based on the epipolar geometry to provide tight constraints on possible
	correspondences.  Maximization of a correlation coefficient in the chosen
	area selects the appropriate match. About 1\%\ of the pixels in an image
	are matched in this manner, although the entire image is used in
	determining the match correlation coefficients. Positioning accuracy
	of Somewhat better than one pixel is obtained.  The system is able to tailor
	sampling window shape in one image to follow roughly the deformation of
	the rectangular window it matches in the other image. This window-shaping
	issue is one of the principal difficulties of cross-correlation analysis
	--- only in the case of flat terrain normal to the line of sight are
	corresponding windows in the two images of the same shape.  Panton's
	solution to this is to approximate the rectangular source window by a
	trapezoidal window in the other image.  The technique is based on a large
	sampling of the surrounding neighbourhood, and uses the terrain relief
	predicted by previous neighbouring correspondences to estimate the shape
	of the trapezoid about a candidate surface point.  This algorithm has been
	implemented in an experimental parallel processing machine which seems to
	achieve quite impressive performance in the processing on relatively
	smooth  natural terrain.
	It is not clear whether or how much operator intervention is required.


\penalty0
\ssheadsection{Moravec System 1980}
\sayref{``Obstacle Avoidance and Navigation in the Real World by a Seeing Robot
Rover,'' Moravec, Hans P.,
Stanford Artificial Intelligence Laboratory, AIM-340, Ph.D. thesis,
September 1980.}

	Moravec's research was aimed at providing vehicle
	control information from
	visual sensing.  His aim was not to construct a depth map, but
	rather to sample interesting points in a scene, and use these to provide
	motion calibration information and obstacle cues.  There are three main
	vision contributions in his research: the {\it interest operator}, the {\it binary
	correlator}, and {\it slider stereo}, the first two of which have been widely
	adopted by researchers in the field. The {\it interest operator} and
	{\it binary correlator} date to 1974. The {\it interest operator} is a
	filtering technique for selecting points at the center of locally maximal
	directional variance --- these are typically corners. The {\it binary correlator}
	finds the best match of a feature in one image with the intensities in the
	other image using a resolution varying technique.  Each feature (as found
	by the {\it interest operator}) is represented as a series of (5) 6x6
	windows, in increasing resolution (i.e. 6x6, 12x12, 24x24,.. in the
	original image). The lowest resolution description of the feature from the
	reference image is moved a pixel at a time over the other reduced image,
	calculating correlation coefficients at each location.  The largest correlation
	coefficient is taken as indicating the best match. The next higher resolution
	window (i.e.\ next smaller window) centered on this is then searched
	(with the next higher resolution
	of the feature). This correlation process continues until a 6x6 patch is matched in
	the unreduced image.  The correlation has about a $10\%$ error rate. In
	{\it slider stereo}, lateral movement of a camera along a track provides 9
	equally spaced camera stations.  Correlation of the resulting 36 (9 choose
	2) possible image pairings provide a series of estimates of distances
	to scene points.
	These estimates are represented as gaussian distributions (mean equal to
	the distance estimate, and the standard deviation inversely proportional to
	the baseline) weighted by the correlation coefficient of the feature matches
	(from the binary correlator). The 36 histograms (distributions) are then
	summed, and the peak taken to indicate the correct match. Stereo tracking between
	vehicle motions is also performed with the {\it interest operator/binary
	correlator} techniques.  Here, features from the central image at the
	previous position are searched for in the central image of the current
	position, and the results of this correlation inform the system of the vehicle's
	actual movement. The positional and depth information obtained from
	these correlations provide data for the navigational control of the
	vehicle.  It knows roughly how far it has moved through the scene, and
	where its obstacles lie. Feature sampling is chosen so as to cover
	most of the scene, uniformly.

\ssheadsection{Gennery System 1980}
\sayref{``Modelling the Environment of an Exploring Vehicle by Means of Stereo Vision,'' Gennery, Donald B.,
Stanford Artificial Intelligence Laboratory, AIM-339,
Ph.D.\ thesis, June 1980.}

	Gennery's system [\gennery] was designed to provide depth data for
	vehicular autonomous navigation.  It uses cross-correlation
	to position points in space.  The system incorporates a ground plane
	finder (utilizing
	Moravec's {\it interest operator} and {\it binary correlator} [\moravec]) that
	estimates a plane in the scene above which most points lie, and uses this
	to estimate the camera relative orientations.  This derived camera
	relative orientation information enables the matching of corresponding
	windows to be constrained to a one-dimensional search.  Having estimates
	of scene noise characteristics (variance, and gain and bias between the
	two images), he defines a correlation measure that provides sub-pixel
	positioning of corresponding windows. Accompanying these are estimates of
	the confidence and accuracy of the correspondences. Since it progresses
	across an image from left to right, his algorithm can use local context of
	previous matches to suggest tentative match sites.  If these are
	inadequate for unambiguous matching of the particular window, search
	constraints based on the epipolar geometry can be used to provide further
	suggestions for the correspondence.  These begin at the {\it infinity} point
	of the corresponding epipolar line (disparity equals zero), and come
	forward (to the left, with increasing disparity) until either a suitable
	correspondence is found or some already matched windows are encountered.
	When the correct locale has been chosen, maximization of a correlation
	coefficient in a vicinity of the selected area determines the local best
	match.  This analysis is followed by a process of fitting ellipsoids to
	the determined elevation data. These, he contends, are an appropriate shape
	representation for use in obstacle avoidance calculations and scene
	matching.

\sheadsection{Feature-based correspondence methods}

\ssheadsection{Arnold System 1978}
\sayref{``Local Context in Matching Edges for Stereo Vision'', R.\ David Arnold,
Proceedings of the ARPA Image Understanding Workshop, Boston, p.\ 65-72, May 1978.}

	This paper describes an edge-based stereo correspondence system which
	uses edge orientation and side intensity, and edge adjacencies in determining the
	set of
	globally optimal edge matches.  Examples are shown of the processing of
	aerial views of an aircraft, cars in a parking lot and an
	apartment complex.  The Moravec interest operator and binary correlator
	[\moravec] and a high resolution correlator and camera solver [\gennery]
	are used in determining the relative orientations of the two imaging
	stations. The Hueckel operator [\hueckel] is applied to the images, producing
	a set of edge elements for the correspondence. The derived camera
	attitude information is then used to reorient the edges to a canonic frame
	--- one where the stereo baseline is along the x-axis and disparity shifts
	due to the tilt of the ground plane are cancelled. Disparities are
	restricted to
	those lying between zero (the ground) and some a priori limit in the x
	direction. A list of possible matches in the right image is obtained
	for each edge in the left image. Loose thresholds are used to specify
	the adjacency structure of
	the edges. A reinforcement/inhibition voting scheme is applied to the
	adjacency structure and match list, and the resulting maxima are chosen
	as the correct matches.

	The technique uses many heuristics and thresholds, and is said to be
	quite sensitive to the output of the Heuckel operator. 


\newpage
\ssheadsection{Control Data Corporation's Automatic Planar Surface System 1979}
\sayref{``Automatic Stereo Recognition of Man-Made Targets,'' Henderson, Robert L., Walter J Miller, C.\ B.\ Grosch,
Society of Photo-Optical Instrumentation Engineers, August 1979, and
Henderson, R.\ L., ``Geometric Reference Preparation Interim Report Two: The Broken
Segment Matcher,'' RADC- TR-79-80, April 1979.}

	These papers describe the `Broken Segment Matcher' phase of the Control
	Data Corporation's stereo research system.  The aim of the work is to
	provide automatic reference preparation capabilities, the references being
	structural models of buildings which may then, at a later point, be
	used in scene recognition for autonomous guidance.  Because of this aim,
	they address their work specifically toward the problem of constructing
	planar models of rectilinear cultural scenes from aerial imagery.
	They take an interesting edge and area-based approach to the problem,
	using edge information to guide the application of their dynamic
	programming intensity correlation.
	Roughly, their algorithm functions as follows:

	\vbox{
	\vspsml\bullit{Geometrically transform a pair of images, bringing
	them into a {\it collinear} epipolar geometry (making corresponding lines
	parallel).}
	\vspsml \bullit{Locate (via a Sobel
	operator) and `thin' edges in the two images.}
	\vspsml \bullit{Establish
	edge correspondences in the first pair of epipolar lines by hand.} }
	\vbox{
	\vspsml	\bullit{Maintain two cooperating correspondence processes to minimize the
	effects of image noise and extraneous detail.  The first process matches
	intensities using only edges deemed to be `reliable', such as those seeded
	to the system
	through the manual startup.  The second process considers {\it all} edges,
	and, using the correspondences found by the first process for the
	particular line correlation it is presently performing, suggests a larger
	set of correspondences.  Those correspondences which are seen to `persist'
	over several preceding second process line analyses (implying that they
	arise from true scene geometric discontinuities) are given for
	consideration to the first process for its {\it next} line analysis.}  }

	The correlation's metric is pixel intensity difference. The two processes
	both use a least squares minimization on these intensity differences to
	choose the optimal edge correspondences.  Edges are used to bound the
	linear regions, or intervals being correlated, and edge correspondence is
	a side effect of the intensity correlation - edges themselves are
	not compared.

	The algorithm progresses from one image epipolar line to another,
	propagating results (to limit subsequent search) as it goes.
	The algorithm, as noted in the summary, requires manual starting.
	It propagates determined correspondences along paths of proximal edges as it
	progresses from line to line.
	Constraints have been built into the system to make it only applicable
	to planar surfaced structures, and the correlation only accepts transitions
	indicative of nearly horizontal or vertical walls $\ldots$ in fact,
	they go to substantial effort to ignore surface detail (such as roads, sidewalks,
	windows). Correspondence is determined by a least-squares optimization
	technique. The algorithm preprocesses the imagery data in a way that
	precludes it from working with anything other than straight lines (as derived
	from sequences of edges) in the images. They have processed and documented the
	analysis of a
	single scene with their algorithm.

	Their aim was to produce a three-dimensional planar rectilinear
	description of cultural scenes. The results shown do not indicate that
	they have succeeded.
	One point to note is that their use of two correspondence processes,
	with the second introducing `new' and removing `old' scene structures from the analysis,
	introduces a hysteresis into the processing --- new structures (in the direction
	of processing) take a while to be believed (`persist'), while old structures take a
	while to disappear once passed. Precision would seem not to have been one of the
	desired properties of their system.
	Further, a recent paper from the group comments on the
	instability and `noisy' nature of the two-process structure
	([\degrysepanton]), and explains several constraints they propose
	introducing to reduce the effects of these problems (see summary under
	[\pantongrosch]).  The
	constraints --- the scene is imaged orthographically, the structures are strictly
	rectilinear, all vertical surfaces are either parallel or orthogonal, and
	all horizontal surfaces are parallel --- are severely restrictive, and
	have no provision for the generality and flexibility a reasonable stereo
	system must have.  Once introduced into the analysis, it is difficult to
	conceive of how these restrictions could be removed for the processing of
	more general domains. The constraints they have used serve to bound the
	applicability of their process, rather than bounding its cost.
	When they are specialized, as these, constraints should be
	applied only when justified by conclusive evidence in the scene.
	More usefully, when the view may be not of a restricted domain,
	the constraints should be derived from observations on {\it general}
	cases and function to distinguish realizable from impossible or
	unlikely interpretations .
	These criticisms aside, however, the CDC group's
	approach was fairly comprehensive, and their use of dynamic
	programming for the optimization is a considerable contribution (dynamic
	programming for stereo correspondence was first documented in a paper by
	Gimel'farb et al.\ ([\gimel]) in 1972).

\ssheadsection{Control Data Corporation Structural Syntax Approach 1981}
\sayref{``Geometric Reference Studies,'' Panton,D.\ L., C.\ B.\ Grosch, D.\ G.\ DeGryse,
J.\ Ozils, A.\ E.\ LaBonte, S.\ B.\ Kaufmann, L.\ Kirvida;
RADC-TR-81-182, Final Technical Report, July 1981.}

	The authors, noting the inadequacies of the previous CDC stereo system,
	[\hendersona, \hendersonb], calling it `blind' to the surrounding context
	of the cultural scene, argued that they needed `techniques that are more
	akin to human picture processing where image elements are symbolically
	related and where one exploits a priori knowledge of how cultural
	structures are constructed.'  They designed a `structural syntax $\ldots$
	to fulfill this geometric need.'  The structural syntax is introduced as a
	set of geometric principles specific to the sort of 3-d cultural scene
	their research addressed.

	This work is concerned with the processing of urban structures
	stereoscopically projected either orthographically or centrally to planar
	imaging surfaces.  Applications in the report were restricted to
	structures in the form of right parallelepipeds.  Specifically, symbolic
	matching was applied only to roofs, and these were restricted to
	horizontal and rectangular, L-shaped, or T-shaped figures.  A substantial
	degree of operator intervention was required in the processing.

	Three geometric principles were introduced as `structural syntax':

	1) The edge orientation principle:  Use of convergence of 3-space parallel lines
	to vanishing points, for clustering parallel (opposing) edges (utilization
	of vanishing points has also been suggested in [\liebes]).

	2) The principle of known transform slope:  Governs allowable 3-space
	orientations of edges, constraining surfaces to be either vertical or
	horizontal.

	3) The min-max transform principle:  Governs the range of acceptable
	heights for structures.

	Currently, vanishing points are determined manually.  The authors appear to
	assume that the alignment of buildings is given, so here the syntax is
	used to restrict the nature and orientation of walls and roofs.

	Testing was done on real imagery from a small portion of one medium sized
	building. Edges selected were quite conspicuous, and were oriented in only
	one direction.  (Images are {\it collinearized} to have epipolar lines
	parallel before stereo processing is begun.)


	Two complementary algorithms were developed for region matching.  Both
	centered on roof identification, under the assumption that roofs were less
	likely to be occluded or contain shadows.  Roofs are assumed parallel to
	the ground. Walls are `dropped' to intersect the ground after roofs are
	identified.  Both systems use edge files to construct roof regions.  One
	of the two algorithms `constructs regions individually and separately in
	each image', and then performs the matching process between the images.
	The other first matches edges in the two images, using 3-space height as a
	constraint, and then constructs regions from stereo pairs of lines.

	ALGORITHM 1: FIGURE MATCHING - Polygonal figures are found in the two
	images.  Symbolic stereo matching is guided by syntax directed corner
	matching.  Syntactic principles guide completion of incomplete line
	figures.

	Tests have been performed on manually produced line files, using real
	camera model transformations.  The files corresponded to horizontal
	surfaces.  Matching proceeded satisfactorily, and incomplete figures were
	correctly completed.

	ALGORITHM 2: LINE SEGMENT MATCHING - This procedure involves pair-wise
	matching of lines from stereo images. Consistency metrics are introduced
	to measure the validity of matches. Application has been made to
	rectangular and tee-shaped roofs.  After rectangles and tees have been
	formed, they are compared, and `line-sharing-conflict' criteria are
	applied to choose a final set of building tops.


	The authors acknowledge `that extensive testing and technique development
	is still necessary within all areas.'  The present capability appears to
	require a substantial degree of skilled operator intervention involving
	iterative tuning and repeated passes through the process.

\ssheadsection{Barnard and Thompson 1979}
\sayref{``Disparity Analysis of Images,'' Stephen T.\ Barnard and William B.\ Thompson,
Computer Science Department, University of Minnesota, Jan.\ 1979.}

	This paper presents an algorithm for matching points across two images,
	with the intent of either determining three-space position (with camera
	motion) or spatial velocities (for object motion).
	Points are chosen, tentatively matched, and then iteratively corrected
	through the assumption of scene continuity.
	Point selection uses the Moravec interest operator and a 5x5
	correlation window to select locally maximal intensity variance locales.
	A threshold is used to choose significant variations.
	The correction process presumes that scene depth is continuous,
	and does a relaxation-type iteration to improve the confidence in similarity
	estimates. These similarity estimates are based on the sum of squared
	differences over the window (5x5).
	Estimates are iterated through an adjustment procedure 10 times, and the
	highest surviving probabilities are taken as indicating the correct
	correspondences (if probability is above 0.7).
	This is a very simple correspondence procedure.  It cannot use camera
	attitude information to guide or limit the search, but because of this
	can be used for motion tracking as well as stereopsis.

\penalty0
\ssheadsection{Marr-Poggio System (Grimson) 1980}
\sayref{``Computing Shape Using a Theory of Human Stereo Vision,'' Grimson, W.\ E.\ L.,
Department of Mathematics, MIT, Ph.D.\ thesis, June 1980.}

The approach of the MIT group is in melding psychological theory and
observations into a computational algorithm for stereo vision.
They consider neurophysiological relevance and biological feasibility crucial
aspects of their algorithm, and support the details of their approach with
extensive references to the perceptual psychology literature.
The algorithm, developed basically by David Marr and Tomaso Poggio [\marrpoggiob],
is an edge-based line-oriented filtering and matching process.
Grimson's implementation of the stereopsis algorithm [\grimson] processes as follows:

\vbox{
\vspsml
\bullit{Fill 4 pairs of working arrays with zero-crossing values and
orientations. The zero-crossings are found by convolving the
images with 4 spatial frequency tuned band-pass filters, varying in size from 7
to 63 pixels in width.}}

\vbox{
\vspssml
\bullit{Set initial vergence values for the eyes in the two images (manually).}
\vspssml
\bullit{Match zero-crossings in the paired arrays with these relative
eye positions. Within paired arrays, the process decides upon acceptable matches
on the basis of zero-crossing {\it contrast} (positive or negative) and very rough
edge orientation estimates (quantized to 30 degrees, so slopes must be within
approximately 60 degrees of eachother). Matches are of {\it positive},
{\it negative}, and {\it zero} disparity, relative to the vergence.}
\vspssml
\bullit{Mark ambiguous or `no-match' edges as such.}
\vspssml
\bullit{Check unmatched points in {\it regions}, and for those
where this number is greater than 30\%, delete all matches. Regions are defined
with regard to some statistical measure to ensure that the size represents a reasonable
local sample.}
\vspssml
\bullit{On the basis of low frequency filter matchings, make various {\it positive}
and {\it negative} vergence movements to bring unmatched high frequency edges
into correspondence (high frequency edges come from the smallest filters),
and iterate on the matching process.}}

A subsequent process interpolates a smooth surface to this derived edge-based
disparity data, resulting in a full depth map.  The assumption that allows
interpolation to take place is that `no information is information', i.e.\ that
the lack of edge signal in a part of the scene indicates that there are no
intensity discontinuities there, and thus likely no depth discontinuities.
This is a valid assumption, although rather dismissive of useful intensity
information. Furthermore, the interpolation scheme makes no distinction
between surface boundaries (depth boundaries) and surface details $\ldots$
the former should likely be breakpoints for the interpolation, rather then
knots.  The resulting surface fitting smooths over the entire scene.
Elegant as the interpolatory analysis is, a true solution to the problem
of defining inter-edge surface shape would need to consider the global context
{\it (`is there any indication that this is an occluding edge?')} and, where
possible, domain knowledge {\it (`does this seem to be the top of a building?')}.

The results published include the analysis of several
random dot stereograms, each composed of 4x4 randomly positioned
black and white squares, with the maximum vergence variation running
from 2 to 6 dot widths.  Other examples include a ground level
building scene, a view from a Mars Viking vehicle, and a random dotted
coffee jar.

Assessment of the algorithm is a bit difficult: it uses a fairly simple
control structure with unsophisticated matching criteria, and its success
from these mechanisms is quite remarkable. But questions arise.  The
approach lacks a mechanism for assessing {\bf global consistency} in its
correspondence results.
It would seem from the discussion of the algorithm that the initial eye
vergence plays an important role in determining the final set of
correspondences. By accepting high frequency channel correspondences on a
{\it local} basis the implementation precludes other vergence matchings
which could be {\it globally} more satisfactory (\_{it should be noted that
lower spatial frequency is {\bf not} synonymous with} \_{{\it globality}
--- see [\juleszglobal]}).  Notice also that the low-frequency to high-frequency
control structure that is said to be as used here is shown in
[\frisbymayhew] to be inadequate as a model for human stereopsis.).  Using
a maximum filter size that corresponds to the largest observed in foveal
vision only (the implementation doesn't vary filter size with
eccentricity, as the theory suggests), Grimson has excluded from his
processing the possibility of the more globally-driven radical vergence
movements that seem necessary for scenes having large disparity
variations.  Perhaps this would be recoverable through the correct
implementation, with filter size varying with eccentricity $\ldots$ he has
only implemented the theory for foveal vision. {\bf Monocular cues}, which
their theory doesn't address, are known to provide information for such
radical vergence movements ([\sayefrisby]). Initial vergence is set
manually; it is not clear how subsequent major vergence adjustments are
controlled. In fact, several control strategies are experimented with in
the text, each to give the optimal results for the channel noise settings
being tested.  No clear definition of vergence control is given.  In the
light of the chronic failure in past vision research to document
limitations and test to the breaking point, it may seem rather unfair to
bring criticism to an apparently successful algorithm such as this, but its
completeness has yet to be demonstrated (an interesting recent extension of the
Marr-Poggio theory of stereopsis that addresses some of these issues is
described in [\mayhewfrisbyaij]).


\ssheadsection{Baker System 1981}
\sayref{``Depth from Edge and Intensity Based Stereo,'' Baker, H.\ Harlyn,
University of Illinois, Ph.D.\ thesis, September 1981.}

	The depth determination system described in this thesis is founded
	on an edge-based line-by-line stereo correspondence scheme.
	Its processing consists of:
\vbox{
\vspsml \bullit{extracting edge
	descriptions of a pair of images (with a simple second difference
	zero-crossing edge operator),}
\vspsml \bullit{linking these edges to their nearest
	neighbors to obtain the connectivity structure of the images,}
\vspsml \bullit{correlating the edge descriptions on the basis of local edge
	measures, and}
\vspsml \bullit{cooperatively removing those edge correspondences formed by the
	correlation which violate the connectivity structure of the two images.}}

	Two further correspondence processes fill in surface detail missed by this
	edge mapping/removal scheme.  The first is another edge-based matching
	that is restricted to the intervals defined by the correspondences
	found earlier. The intention here is to pair edges that were
	either unmatched or found too rivalrous by the total line matching algorithm,
	but are readily paired within the tighter context of corresponding intervals
	(although Baker mentions that this phase of the correlation is still in
	development). The final correspondence process is one which
	uses a technique similar to that used for the edges
	but operating on the image intensity values within corresponding intervals.
	The result of the processing is a full
	image array disparity map of the scene. Dynamic programming, in the
	form of the Viterbi algorithm, is used for all (there are four)
	correspondence processes.

	The system incorporates both an edge-based and an intensity-based analysis
	in its processing. Edges are matched in a near-optimal line-by-line
	dynamic programming correlation.  This correlation allows edges to
	remain unpaired (from which they may be considered either occluded or spurious).
	Edges in reduced resolution versions of the images are
	first correlated to provide rough estimates of the scene correspondences.
	These then constrain the disparity values possible for edge matchings in
	the full resolution images. Properties of the edges used for the matching
	are contrast, intensities to the sides, orientation in the image planes, and
	proximity to nearest (correlable) edges along the epipolar line.
	The use of the last two parameters is based on an
	analysis of Arnold and Binford ([\arnbinf]).  The dynamic programming algorithm maximizes
	the decision metric, giving the best (local to the line) interpretation of
	the edges as part of a realizable scene.  Two-dimensional connectivity of
	the edges is used to remove correspondences inconsistent with a single
	global interpretation.  Remaining matches form a kernel of good
	correspondences from which the further edge-based and then intensity-based
	dynamic programming correlations complete the image matchings up to the point
	of a full disparity map.

	Its exploitation of line-by-line redundancy and 2-D continuity for global
	consensus free this system from the need for either manual initialization
	(as [\panton], [\hendersona], [\grimson]), or sequential line processing (as
	[\levine], [\panton], [\hendersona], [\gennery]), thus avoiding
	the problems of predictive matching where miscorrelations can lead to
	deviation from the correct surface solution.
	This is the first stereo system to make explicit use of
	continuity along extended contours in disambiguating correspondence
	rivalries, and it uses a considerably more sophisticated matching metric
	in its correspondences than any of the above systems.
	Noticeable is the fact that the correlation performed on
	intensity values needs improvement; although its disparity context is
	quite well defined by the `kernel' of good edge correspondences, the pixel
	intensity correspondence is subject to line-by-line aberrations.
	Smoothness in disparities along the line is controlled by the
	optimization, but no such constraints exist for {\it between} line disparities
	-- a better, two-dimensional fitting algorithm is needed.  Results show
	the processing of two quite disparate scenes, but demonstration on more
	imagery is necessary to convince one of the algorithm's generality.


\ssheadsection{Arnold System 1982}
\sayref{``Automated Stereo Perception,'' R.\ David Arnold, Department of Computer Science, Stanford
University, forthcoming Ph.D.\ thesis, 1982.}

	This thesis deals with feature-based stereo correspondence.
	The use of epipolar geometry makes the matching a
	one-dimensional problem. Edge continuity is utilized in combining
	matches along adjacent epipolar lines to produce a match over an entire
	image.

	An important component of the thesis is the derivation of two specific
	analytic functions. These functions are based on geometric constraints for
	the matching of edges in the one-dimensional stereo correspondence problem.
	One concerns constraints on the {\it intervals}
	between adjacent edges and the other concerns constraints on the {\it
	angles} of matched edges.  These results allow a distribution function in
	the object space to be translated to a distribution function in the image
	space. The functions allow probability estimates to be made on the
	likelihood that edges from the two images correspond, and can be used
	in the selection of the best pairing among a set of alternatives.  The analysis
	suggests that the functions are sharply peaked even for the 60 degree
	vergence angles used in aerial photography.  When angles corresponding to
	human vision are used, the conditions are extremely strong. The author has
	modified a dynamic programming algorithm to incorporate these matching
	constraints with an interpretation mechanism that requires an explanation
	for each occluded edge or surface (interval).

	A globally optimal match may be sub-optimal when viewed from the narrow
	context of a single epipolar line.  Edge continuity across epipolar lines
	is used to take the one-dimensional matchings into a globally consistent
	two-dimensional matching.  This is achieved through an extension of the
	Viterbi algorithm which produces for each line matching a list of all
	matches scoring within a preselected range of the optimal match. This list
	is filtered by an iterative process which enforces consistency among
	adjacent epipolar lines.

	The system assumes an input of linked edges; that is, extended edges
	rather than edge elements (in fact, the current implementation is based on
	straight line segments).  Input files were prepared by hand through
	computer-assisted tracing of real imagery.  Effort was made to reflect the
	imperfection of real data. Constaints  based on image  intensities are
	used only very weakly . The output of the system is a three-dimensional
	map of edges in the scene.

	The principal contributions of this thesis are the two probability functions
	and the modification to the Viterbi Algorithm enabling sub-optimal paths to
	be computed and retained for later disambiguation.  It relies on clean
	edge data (in the dynamic programming optimization every edge must be
	explained, yet none can be considered spurious), and manual processing for
	the connectivity, orientation and position of edges, and the intensity values
	of surfaces. These
	restrictions make it difficult to judge the system's potential in real
	imagery, automatically processed. The contributions will doubtless
	be used in future stereo correspondence processes ([\bakerthesis] exploits
	the probabilistic interval and edge orientation measures derived here).

\newpage